{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "ac7038265d6c97972ddde49400702c543552918a"
   },
   "source": [
    "- <a href='#0'>0. Introduction</a>  \n",
    "- <a href='#1'>1. Get the Data</a>\n",
    "- <a href='#2'>2. Check the Data</a>\n",
    "- <a href='#3'> 3. Explore the data</a>\n",
    "    - <a href='#3-1'>3.1 Categorical features</a>\n",
    "    - <a href='#3-2'>3.2 Numerical features</a>\n",
    "    - <a href='#3-3'>3.3 Categorical features by label</a>\n",
    "    - <a href='#3-4'>3.4 Numerical features by label</a>\n",
    "    - <a href='#3-5'>3.5 Correlation Matrix</a>\n",
    "- <a href='#4'> 4. A further exploration on application table</a>\n",
    "    - <a href='#4-1'>4.1 Impute missing values</a>\n",
    "    - <a href='#4-2'>4.2 Create more feature</a>\n",
    "    - <a href='#4-3'>4.3 Train model</a>\n",
    "    - <a href='#4-4'>4.4 Feature importance</a>\n",
    "    - <a href='#4-5'>4.5 Prediction</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "c79c5a2b4f974bf6efd2f7d42dc3a9c92347f9b1"
   },
   "source": [
    "## <a id='0'>0. Introduction</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "c6292a6edfb8a369bc6230035daef588421fd1f2"
   },
   "source": [
    " [Home Credit](http://www.homecredit.net/[](http://) is an international non-bank financial institution founded in 1997 in the Czech Republic. The company operates in 14 countries and focuses on lending primarily to people with little or no credit history. \n",
    "\n",
    "Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--**to predict their clients' repayment abilities.**\n",
    "\n",
    "While Home Credit is currently using various statistical and machine learning methods to make these predictions, **they're challenging Kagglers to help them unlock the full potential of their data**. Doing so will ensure that clients capable of repayment are not rejected and that loans are given with a principal, maturity, and repayment calendar that will empower their clients to be successful.\n",
    "![](http://www.homecredit.net/~/media/Images/H/Home-Credit-Group/image-gallery/full/image-gallery-01-11-2016-b.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "703773ae255bf7da9569d4f2aa720fa865dc15a3"
   },
   "source": [
    "## <a id='1'>1. Get the data</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "_cell_guid": "b1076dfc-b9ad-4769-8c92-a6c4dae69d19",
    "_uuid": "8f2839f25d086af736a60e9eeb907d3b93b6e0e5",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "pd.set_option('display.max_columns', None)\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "%matplotlib inline\n",
    "color = sns.color_palette()\n",
    "import gc\n",
    "import warnings\n",
    "import time\n",
    "warnings.filterwarnings(\"ignore\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "_cell_guid": "79c7e3d0-c299-4dcb-8224-4455121ee9b0",
    "_uuid": "d629ff2d2480ee46fbb7e2d37f6b5fab8052498a",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "application_train = pd.read_csv('../input/application_train.csv')\n",
    "application_test= pd.read_csv('../input/application_test.csv')\n",
    "bureau = pd.read_csv('../input/bureau.csv')\n",
    "bureau_balance = pd.read_csv('../input/bureau_balance.csv')\n",
    "POS_CASH_balance = pd.read_csv('../input/POS_CASH_balance.csv')\n",
    "credit_card_balance = pd.read_csv('../input/credit_card_balance.csv')\n",
    "previous_application = pd.read_csv('../input/previous_application.csv')\n",
    "installments_payments = pd.read_csv('../input/installments_payments.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "10564d0e5600e2fbb0e708277777f6c74c01b86b"
   },
   "source": [
    "This file contains descriptions for the columns in the various data files.\n",
    "\n",
    "<img src=\"https://storage.googleapis.com/kaggle-media/competitions/home-credit/home_credit.png\" width=\"800\"></img>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "_uuid": "c422fa6943df5eeca9c5d0ced9deeede1fa55274",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print('------------main------------')\n",
    "print('application_train:', application_train.shape[0], \"rows and\", application_train.shape[1],'columns')\n",
    "print('application_test:', application_test.shape[0], \"rows and\", application_test.shape[1],'columns')\n",
    "print('      ')\n",
    "print('------------others------------')\n",
    "print('POS_CASH_balance:', POS_CASH_balance.shape[0], \"rows and\", POS_CASH_balance.shape[1],'columns')\n",
    "print('bureau:', bureau.shape[0], \"rows and\", bureau.shape[1],'columns')\n",
    "print('bureau_balance:', bureau_balance.shape[0], \"rows and\", bureau_balance.shape[1],'columns')\n",
    "print('previous_application:', previous_application.shape[0], \"rows and\", previous_application.shape[1],'columns')\n",
    "print('installments_payments:', installments_payments.shape[0], \"rows and\", installments_payments.shape[1],'columns')\n",
    "print('credit_card_balance:', credit_card_balance.shape[0], \"rows and\", credit_card_balance.shape[1],'columns')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "fb9266061d05147086c3508cd435f62458448739"
   },
   "source": [
    "## <a id='2'>2. Check the data</a>\n",
    "### 2.1 application train / test"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "_uuid": "6ed6c90dfa5a7c436b3c768a82a3c103edff7bcc",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "application_train.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "_uuid": "7e5bf10dfb7f98006463db8d3a27641aacd169ae",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "application_train.columns.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "_uuid": "39ba572c974e791660fb298ff88c41db248d5d40",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def find_missing(data):\n",
    "    # number of missing values\n",
    "    count_missing = data.isnull().sum().values\n",
    "    # total records\n",
    "    total = data.shape[0]\n",
    "    # percentage of missing\n",
    "    ratio_missing = count_missing/total\n",
    "    # return a dataframe to show: feature name, # of missing and % of missing\n",
    "    return pd.DataFrame(data={'missing_count':count_missing, 'missing_ratio':ratio_missing}, index=data.columns.values)\n",
    "find_missing(application_train).head(12)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "_uuid": "046b279e0beb50b68ab3b24620e38b1f0841c109",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "find_missing(application_test).head(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "e2bb42ed3f618d2edc8e1bd3bf326c0836fb1d4e"
   },
   "source": [
    "### 2.2 POS_CASH_balance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "_uuid": "54e5206116f944290d2db4352bae0ec7fa933cd0",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "POS_CASH_balance.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "_uuid": "ed0f333a854f54b6f51bff6a0733eccec1cad723",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "POS_CASH_balance.columns.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "_uuid": "2c847477730c786d3132595159b46f450ded4f65",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "find_missing(POS_CASH_balance).head(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "001a9744f32d0fd3b638fdbca8a91ce5f353b566"
   },
   "source": [
    "### 2.3 bureau"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "_uuid": "761a7ba7def762a8000e46c69c2c9268dcb29e0f",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "bureau.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "_uuid": "e18df4188c7002f37b5ab70a71f2e5ec15f03452",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "bureau.columns.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "_uuid": "63feb85dac97e2f3956f5b4838a204351fa8dc4b",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "find_missing(bureau).head(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "d73a8a32a87ec68aca1221e27d00c4c793cd012c"
   },
   "source": [
    "### 2.4 bureau_balance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "_uuid": "487486153a5734c6e38b8d2feb10040968b94722",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "bureau_balance.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "_uuid": "2faf8566eaff4dd17f5090676d03932755f981c1",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "bureau_balance.columns.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "_uuid": "47c6f985376ecc7c6ac078c414584d0e976a172e",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "find_missing(bureau_balance).head(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "0f51aa6108b91a52b81dd3d6bd820b34b47833a8"
   },
   "source": [
    "### 2.5 credit_card_balance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "_uuid": "ec0e03122e35465cb055685509101364d1a9664b",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "credit_card_balance.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "_uuid": "b63c84aae53d05b9e2ae1191bda8bfeeaabb46be",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "credit_card_balance.columns.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "_uuid": "323c17de8901fdeae5a3a1ba5aacfa1c53838cb0",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "find_missing(credit_card_balance).head(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "e905b20bcb0fc11d25193c3f20dfcae47071ac73"
   },
   "source": [
    "### 2.6 previous_application"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "_uuid": "78514694d73a8eeead5a5b8295491330702c723a",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "previous_application.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "_uuid": "75451486f2c96eba5ab29639884d564b43fa6d9e",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "previous_application.columns.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "_uuid": "8d73819eb83abd76f5ecfd2c081b636c4ef18a56",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "find_missing(previous_application).head(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "1c57b2202f06436afa0cf674e97e43392c704ba5"
   },
   "source": [
    "### 2.7 installments_payments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "_uuid": "8000062c2beb4e4a8c9e757885b268fdbda9abab",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "installments_payments.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "_uuid": "79ac8ca7fc8cb8ce2e06f2e314a85c8214a7eb66",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "installments_payments.columns.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "_uuid": "3b45cdacce5c22c5e847c77bfca75fa0132204b9",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "find_missing(installments_payments).head(12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "82cc4fb6e209a6bdf14cd1d85fdb12f91656a2e4"
   },
   "source": [
    "## <a id='3'>3. Explore the data</a>\n",
    "\n",
    "### <a id='3-1'>3.1 Categorical features</a>\n",
    "#### Label"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 85,
   "metadata": {
    "_uuid": "7d68360e304d98e1e2617cf7736a776ff51e9dfb",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def plot_categorical(data, col, size=[8 ,4], xlabel_angle=0, title=''):\n",
    "    '''use this for ploting the count of categorical features'''\n",
    "    plotdata = data[col].value_counts()\n",
    "    plt.figure(figsize = size)\n",
    "    sns.barplot(x = plotdata.index, y=plotdata.values)\n",
    "    plt.title(title)\n",
    "    if xlabel_angle!=0: \n",
    "        plt.xticks(rotation=xlabel_angle)\n",
    "    plt.show()\n",
    "plot_categorical(data=application_train, col='TARGET', size=[8 ,4], xlabel_angle=0, title='train set: label')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "83c069e021dee19afaeb2d001cabe1436ed9ce29"
   },
   "source": [
    "### Occupation Type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {
    "_uuid": "3115a0520301af1c513fff68363ff50e83daf358",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "plot_categorical(data=application_train, col='OCCUPATION_TYPE', size=[12 ,4], xlabel_angle=30, title='Occupation Type')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "00e62965043ba1ffc8642dd94e6e7d1ec63078ae"
   },
   "source": [
    "#### Gender"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "0f71531f720230047c1eff018e59536d99a77d51"
   },
   "source": [
    "### Income Type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 105,
   "metadata": {
    "_uuid": "1a5cbfb6581f0740571e87b6e206f05c2c953bfc",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "plot_categorical(data=application_train, col='NAME_INCOME_TYPE', size=[12 ,4], xlabel_angle=0, title='Income Type')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "ea53791a616c2e61d40747be844e49d6402593ee"
   },
   "source": [
    "### House Type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 104,
   "metadata": {
    "_uuid": "86ada60873d8666cd897f1843a3f0187ff808381",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "plot_categorical(data=application_train, col='NAME_HOUSING_TYPE', size=[12 ,4], xlabel_angle=0, title='House Type')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "1731aa39d0bdcdeba6f1d8eca46939f1d351fa21"
   },
   "source": [
    "### <a id='3-2'>3.2 Numerical features</a>\n",
    "#### Credit Amount"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {
    "_uuid": "60a7114d4a94e772c6002bf4affc3134e2efa657",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def plot_numerical(data, col, size=[8, 4], bins=50):\n",
    "    '''use this for ploting the distribution of numercial features'''\n",
    "    plt.figure(figsize=size)\n",
    "    plt.title(\"Distribution of %s\" % col)\n",
    "    sns.distplot(data[col].dropna(), kde=True,bins=bins)\n",
    "    plt.show()\n",
    "plot_numerical(application_train, 'AMT_CREDIT')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "5d94712d8d9e2e1ec792e83945281bc30e7bcc72"
   },
   "source": [
    "#### Annuity Amount"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 135,
   "metadata": {
    "_uuid": "7d60c0bab648e9d3dfe5d19090aa66d25395f1e6",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "plot_numerical(application_train, 'AMT_ANNUITY')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "1381a16a547fdda6bfd33f9c1920127a30631ef1"
   },
   "source": [
    "### Days employed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 136,
   "metadata": {
    "_uuid": "0494defa717079da52dd0e3f6618c8b6a007e62b",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "plot_numerical(application_train, 'DAYS_EMPLOYED')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "664de48c31004cd49417e968560bc2f05912aa23"
   },
   "source": [
    "### <a id='3-3'>3.3 Categorical features by label</a>\n",
    "#### Gender"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "_uuid": "9bf826000f25b111112b2ed7d014a3d7394f9485",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def plot_categorical_bylabel(data, col, size=[12 ,6], xlabel_angle=0, title=''):\n",
    "    '''use it to compare the distribution between label 1 and label 0'''\n",
    "    plt.figure(figsize = size)\n",
    "    l1 = data.loc[data.TARGET==1, col].value_counts()\n",
    "    l0 = data.loc[data.TARGET==0, col].value_counts()\n",
    "    plt.subplot(1,2,1)\n",
    "    sns.barplot(x = l1.index, y=l1.values)\n",
    "    plt.title('Default: '+title)\n",
    "    plt.xticks(rotation=xlabel_angle)\n",
    "    plt.subplot(1,2,2)\n",
    "    sns.barplot(x = l0.index, y=l0.values)\n",
    "    plt.title('Non-default: '+title)\n",
    "    plt.xticks(rotation=xlabel_angle)\n",
    "    plt.show()\n",
    "plot_categorical_bylabel(application_train, 'CODE_GENDER', title='Gender')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "3bcd37429d90bdf9bd21be09b91ae0cb3b94494a"
   },
   "source": [
    "#### Education Type"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 157,
   "metadata": {
    "_uuid": "730489b1a4375b1045fa9437eccf7d2a465e4678",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "plot_categorical_bylabel(application_train, 'NAME_EDUCATION_TYPE', size=[15 ,6], xlabel_angle=15, title='Education Type')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "373376e30f2ab5f9cc794ea5ff7bc223b0f9bad3"
   },
   "source": [
    "### <a id='3-4'>3.4 Numerical features by label</a>\n",
    "#### EXT_SOURCE_1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "_uuid": "4d565e0409e8108d03d9d39743debaa7b989c6ee",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def plot_numerical_bylabel(data, col, size=[12, 6]):\n",
    "    plt.figure(figsize=size)\n",
    "    # Calculate the correlation coefficient between the new variable and the target\n",
    "    corr = data['TARGET'].corr(data[col])\n",
    "    \n",
    "    # Calculate medians for repaid vs not repaid\n",
    "    avg_repaid = data.ix[data['TARGET'] == 0, col].median()\n",
    "    avg_not_repaid = data.ix[data['TARGET'] == 1, col].median()\n",
    "    \n",
    "    plt.figure(figsize = (12, 6))\n",
    "    \n",
    "    # Plot the distribution for target == 0 and target == 1\n",
    "    sns.kdeplot(data.ix[data['TARGET'] == 0, col], label = 'TARGET == 0')\n",
    "    sns.kdeplot(data.ix[data['TARGET'] == 1, col], label = 'TARGET == 1')\n",
    "    \n",
    "    # label the plot\n",
    "    plt.xlabel(col); plt.ylabel('Density'); plt.title('%s Distribution' % col)\n",
    "    plt.legend();\n",
    "    # print out the correlation\n",
    "    print('The correlation between %s and the TARGET is %0.4f' % (col, corr))\n",
    "    # Print out average values\n",
    "plot_numerical_bylabel(application_train, 'EXT_SOURCE_1')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "a91dd5806bfb023571c1125302ecc6e55b27df47"
   },
   "source": [
    "#### EXT_SOURCE_2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 159,
   "metadata": {
    "_uuid": "5a95dc911af4e905d03ee7cf6e23109889b6c49a",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "plot_numerical_bylabel(application_train, 'EXT_SOURCE_2')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "3a76fb34fb99de560d5792bff2ba742db13b5a42"
   },
   "source": [
    "#### EXT_SOURCE_3"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "_uuid": "c0c050338883706991082501363e7ce11a4f8ebd",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "plot_numerical_bylabel(application_train, 'EXT_SOURCE_3')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "0f760891270b5984b5dc4f9afd66e54be3a1693d"
   },
   "source": [
    " ### <a id='3-5'>3.5 Correlation Matrix</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 168,
   "metadata": {
    "_uuid": "bb56720b489eea92235ee1b84927d1f9a4c9aa34",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "corr_mat = application_train.corr()\n",
    "plt.figure(figsize=[15, 15])\n",
    "sns.heatmap(corr_mat.values, annot=False)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "778a37b05840c021288b5d80764cb140f513f609"
   },
   "source": [
    " ## <a id='4'>4 A further exploration on application table</a>\n",
    " ### <a id='4-1'>4.1 Impute missing values</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 194,
   "metadata": {
    "_uuid": "57a971e9536f92635dcbc81cdbed1f258c876e11",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from sklearn.impute import SimpleImputer, MICEImputer\n",
    "application_train = pd.read_csv('../input/application_train.csv')\n",
    "application_test= pd.read_csv('../input/application_test.csv')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "4b62191466878351ae3007af825374ae3278a1b4"
   },
   "source": [
    "### split categorical, discrete and numerical features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 195,
   "metadata": {
    "_uuid": "10c68a3de81f85b37bd91c459f1dd5965559b89c",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def feature_type_split(data, special_list=[]):\n",
    "    cat_list = []\n",
    "    dis_num_list = []\n",
    "    num_list = []\n",
    "    for i in data.columns.tolist():\n",
    "        if data[i].dtype == 'object':\n",
    "            cat_list.append(i)\n",
    "        elif data[i].nunique() < 25:\n",
    "            dis_num_list.append(i)\n",
    "        elif i in special_list:     # if you want to add some special cases\n",
    "            dis_num_list.append(i)\n",
    "        else:\n",
    "            num_list.append(i)\n",
    "    return cat_list, dis_num_list, num_list\n",
    "cat_list, dis_num_list, num_list = feature_type_split(application_train, special_list=['AMT_REQ_CREDIT_BUREAU_YEAR']) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 196,
   "metadata": {
    "_uuid": "4b64a4bfd519929a150fe35a15e82f60b6be2e69",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print(str(len(cat_list)),'categorical features:', cat_list)\n",
    "print('-----------------------------------------')\n",
    "print(str(len(dis_num_list)),'discrete numerical features:',dis_num_list)\n",
    "print('-----------------------------------------')\n",
    "print(str(len(num_list)),'continuous numerical features:',num_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "ca738d66aa0ea7e506e45af5362306d7b5553fb3"
   },
   "source": [
    "### convert categorical using LabelEncoder"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 198,
   "metadata": {
    "_uuid": "f3ec8253742e7febec3f7ff0fe29d290f2df37ce",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "print(application_train.shape)\n",
    "print(application_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "c8e17d731269975070541d7798ea2df0206b3e7a"
   },
   "source": [
    "### impute missing values\n",
    "- for categorical and discrete features: use **'mode'** in SimpleImputer\n",
    "- for continuous features: use [MICEImputer](http://scikit-learn.org/dev/modules/generated/sklearn.impute.MICEImputer.html) with **median** as initial strategy "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 199,
   "metadata": {
    "_uuid": "b23b1988478ade7b7ea9ad55e404fcadda30fd3d",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# discrete\n",
    "# start = time.time()\n",
    "# application_train[cat_dummy_list] = SimpleImputer(strategy='most_frequent').fit_transform(application_train[cat_dummy_list])\n",
    "# application_train[dis_num_list] = SimpleImputer(strategy='most_frequent').fit_transform(application_train[dis_num_list])\n",
    "# application_test[cat_dummy_list]  = SimpleImputer(strategy='most_frequent').fit_transform(application_test[cat_dummy_list])\n",
    "# application_test[dis_num_list]  = SimpleImputer(strategy='most_frequent').fit_transform(application_test[dis_num_list])\n",
    "# end = time.time()\n",
    "# print('discrete imputing time:', str(round((end - start)/60)), 'mins')\n",
    "# # continuous \n",
    "# start = time.time()\n",
    "# application_train[num_list] = MICEImputer(initial_strategy='median', n_imputations=50, n_nearest_features=20, verbose=True).fit_transform(application_train[num_list])\n",
    "# application_test[num_list]  = MICEImputer(initial_strategy='median', n_imputations=50, n_nearest_features=20, verbose=True).fit_transform(application_test[num_list])\n",
    "# end = time.time()\n",
    "# print('continuous imputing time:', str(round((end - start)/60)), 'mins')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "b1945b08655293f6b20d1d9ca274d7ea0eff2cdb"
   },
   "source": [
    " ### <a id='4-2'>4.2 Create more features</a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "2718d9792cd694c90434f83703de95961813891c"
   },
   "source": [
    "### Term: total credit / annuity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 200,
   "metadata": {
    "_uuid": "116315677c628ccb9d4086b313eda5bd14af201a",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# some new features\n",
    "application_train['TERM'] = application_train.AMT_CREDIT / application_train.AMT_ANNUITY\n",
    "application_test['TERM'] = application_test.AMT_CREDIT / application_test.AMT_ANNUITY"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "867ae3d19a266a9816a8bb232dffffa34d95cf14"
   },
   "source": [
    "### OVER_EXPECT_CREDIT: actual credit larger than goods price"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 201,
   "metadata": {
    "_uuid": "304763d22634fefb90f01b61ab092b04191ec7f1",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "application_train['OVER_EXPECT_CREDIT'] = (application_train.AMT_CREDIT > application_train.AMT_GOODS_PRICE).map({False:0, True:1})\n",
    "application_test['OVER_EXPECT_CREDIT'] = (application_test.AMT_CREDIT > application_test.AMT_GOODS_PRICE).map({False:0, True:1})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "1a75e13858e377a46c4ab5999cf1face99638f5f"
   },
   "source": [
    "### MEAN_BUILDING_SCORE_TOTAL: the sum of all building AVG score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 202,
   "metadata": {
    "_uuid": "c070e1915f90885999c4f9bc46264893d79d83e7",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "application_train['MEAN_BUILDING_SCORE_AVG'] = application_train.iloc[:, 44:58].mean(skipna=True, axis=1)\n",
    "application_train['TOTAL_BUILDING_SCORE_AVG'] = application_train.iloc[:, 44:58].sum(skipna=True, axis=1)\n",
    "application_test['MEAN_BUILDING_SCORE_AVG'] = application_test.iloc[:, 44:58].mean(skipna=True, axis=1)\n",
    "application_test['TOTAL_BUILDING_SCORE_AVG'] = application_test.iloc[:, 44:58].sum(skipna=True, axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "10d8d8196e3395f585c25b3243be29205f77c9b7"
   },
   "source": [
    "### FLAG_DOCUMENT_TOTAL: the total number of provided document"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 203,
   "metadata": {
    "_uuid": "69771476055d987c02cf360e6900fe3d2bdcc21e",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "application_train['FLAG_DOCUMENT_TOTAL'] = application_train.iloc[:, 96:116].sum(axis=1)\n",
    "application_test['FLAG_DOCUMENT_TOTAL'] = application_test.iloc[:, 96:116].sum(axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "bf46eaa211a46e3d3733ff20a0f591205edb48ca"
   },
   "source": [
    "### AMT_REQ_CREDIT_BUREAU_TOTAL: the total number of enquiries"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 204,
   "metadata": {
    "_uuid": "2901da74cb3f2b79ebdfa506d20b062bae54231a",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "application_train['AMT_REQ_CREDIT_BUREAU_TOTAL'] = application_train.iloc[:, 116:122].sum(axis=1)\n",
    "application_test['AMT_REQ_CREDIT_BUREAU_TOTAL'] = application_test.iloc[:, 116:122].sum(axis=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "3b311d34ab3837e1632693fb5492dcb20f340fe0"
   },
   "source": [
    "### BIRTH_EMPLOTED_INTERVEL: the days between born and employed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 205,
   "metadata": {
    "_uuid": "cdca8087ebffa926f952a54cae03f1986ab5073f",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "application_train['BIRTH_EMPLOTED_INTERVEL'] = application_train.DAYS_EMPLOYED - application_train.DAYS_BIRTH\n",
    "application_test['BIRTH_EMPLOTED_INTERVEL'] = application_test.DAYS_EMPLOYED - application_test.DAYS_BIRTH"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "7589f8351523bda5084fa659e3b3f8d8c25f7486"
   },
   "source": [
    "## More features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "_uuid": "827e6fe1bee5efa30c7d8ed427f0321fede75090",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "application_train.ORGANIZATION_TYPE = application_train.ORGANIZATION_TYPE.map(lambda x: x[:str.find(x,'Type')-1] if str.find(x,'Type')>0 else x)\n",
    "application_train.ORGANIZATION_TYPE = application_train.ORGANIZATION_TYPE.map(lambda x: x[:str.find(x,'type')-1] if str.find(x,'type')>0 else x)\n",
    "application_test.ORGANIZATION_TYPE = application_test.ORGANIZATION_TYPE.map(lambda x: x[:str.find(x,'Type')-1] if str.find(x,'Type')>0 else x)\n",
    "application_test.ORGANIZATION_TYPE = application_test.ORGANIZATION_TYPE.map(lambda x: x[:str.find(x,'type')-1] if str.find(x,'type')>0 else x)\n",
    "application_train['MEAN_BUILDING_SCORE_AVG'] = application_train.iloc[:, 44:58].mean(skipna=True, axis=1)\n",
    "application_train['TOTAL_BUILDING_SCORE_AVG'] = application_train.iloc[:, 44:58].sum(skipna=True, axis=1)\n",
    "application_test['MEAN_BUILDING_SCORE_AVG'] = application_test.iloc[:, 44:58].mean(skipna=True, axis=1)\n",
    "application_test['TOTAL_BUILDING_SCORE_AVG'] = application_test.iloc[:, 44:58].sum(skipna=True, axis=1)\n",
    "application_train['INCOME_PER_FAMILY_MEMBER'] = application_train.AMT_INCOME_TOTAL / application_train.CNT_FAM_MEMBERS\n",
    "application_test['INCOME_PER_FAMILY_MEMBER'] = application_test.AMT_INCOME_TOTAL / application_test.CNT_FAM_MEMBERS\n",
    "application_train['SEASON_REMAINING'] = application_train.AMT_INCOME_TOTAL/4 -  application_train.AMT_ANNUITY\n",
    "application_test['SEASON_REMAINING'] = application_test.AMT_INCOME_TOTAL/4 -  application_test.AMT_ANNUITY\n",
    "application_train['RATIO_INCOME_GOODS'] = application_train.AMT_INCOME_TOTAL -  application_train.AMT_GOODS_PRICE\n",
    "application_test['RATIO_INCOME_GOODS'] = application_test.AMT_INCOME_TOTAL -  application_test.AMT_GOODS_PRICE\n",
    "application_train['CHILDREN_RATIO'] = application_train['CNT_CHILDREN'] / application_train['CNT_FAM_MEMBERS']\n",
    "application_test['CHILDREN_RATIO'] = application_test['CNT_CHILDREN'] / application_test['CNT_FAM_MEMBERS']\n",
    "# convert categorical variables to numericals\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "def label_encoder(input_df, encoder_dict=None):\n",
    "    \"\"\" Process a dataframe into a form useable by LightGBM \"\"\"\n",
    "    # Label encode categoricals\n",
    "    categorical_feats = input_df.columns[input_df.dtypes == 'object']\n",
    "    for feat in categorical_feats:\n",
    "        encoder = LabelEncoder()\n",
    "        input_df[feat] = encoder.fit_transform(input_df[feat].fillna('NULL'))\n",
    "    return input_df, categorical_feats.tolist(), encoder_dict\n",
    "application_train, categorical_feats, encoder_dict = label_encoder(application_train)\n",
    "application_test, categorical_feats, encoder_dict = label_encoder(application_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "60535bb09296d5fe6ed63a32e67afbddce7b5598"
   },
   "source": [
    "### <a id='4-3'>4.3 Train model</a>\n",
    "\n",
    "**application**: 'binary' for binary classification\n",
    "\n",
    "**num_iterations**: number of boosting iterations/trees, **n_estimators** in sklearn\n",
    "\n",
    "**learning_rate**\n",
    "\n",
    "**num_leaves**: number of leaves in one tree\n",
    "\n",
    "**feature_fraction**: part of features used for each iteration\n",
    "\n",
    "**bagging_fraction**: part of data used for each iteration\n",
    "\n",
    "**lambda_l1/lambda_l2**: L1/L2 regularization\n",
    "\n",
    "**min_split_gain**: the minimun gain to perform a split\n",
    "\n",
    "**early_stopping_round**: if the validation metric can't improve for n rounds, stop iteration\n",
    "\n",
    "**categorical_feature**: LightGBM API can deal with categorical feature automatically, **but we need transform string into integer**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "_uuid": "3475d544cea7e0aee6b82f795a8b141805d77dd4",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import lightgbm as lgb\n",
    "from lightgbm import LGBMClassifier\n",
    "from sklearn.metrics import roc_auc_score\n",
    "from sklearn.model_selection import StratifiedKFold"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 207,
   "metadata": {
    "_uuid": "a28d1017430cfc9a480b038ca12e0f98fb6db0a4",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "X = application_train.drop(['SK_ID_CURR', 'TARGET'], axis=1)\n",
    "y = application_train.TARGET\n",
    "X_pred = application_test.drop(['SK_ID_CURR'], axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 215,
   "metadata": {
    "_uuid": "058832e38bd587f596b498011a9e873a5078f53b",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "folds = StratifiedKFold(n_splits=5,random_state=6)\n",
    "oof_preds = np.zeros(X.shape[0])\n",
    "sub_preds = np.zeros(X_pred.shape[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 216,
   "metadata": {
    "_uuid": "899e64e9ff6a007d1a72f68afd72f9811fd68625",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "start = time.time()\n",
    "valid_score = 0\n",
    "for n_fold, (trn_idx, val_idx) in enumerate(folds.split(X, y)):\n",
    "    trn_x, trn_y = X.iloc[trn_idx], y[trn_idx]\n",
    "    val_x, val_y = X.iloc[val_idx], y[val_idx]    \n",
    "    \n",
    "    train_data = lgb.Dataset(data=trn_x, label=trn_y)\n",
    "    valid_data = lgb.Dataset(data=val_x, label=val_y)\n",
    "    \n",
    "    param = {'application':'binary','num_iterations':4000, 'learning_rate':0.05, 'num_leaves':32, 'feature_fraction':0.9, 'bagging_fraction':0.9,\n",
    "             'lambda_l1':0.1, 'lambda_l2':0.1, 'min_split_gain':0.01, 'early_stopping_round':100, 'max_depth':7, 'min_child_weight':40, 'metric':'auc'}\n",
    "    lgb_es_model = lgb.train(param, train_data, valid_sets=[train_data, valid_data], verbose_eval=50) \n",
    "    \n",
    "    oof_preds[val_idx] = lgb_es_model.predict(val_x, num_iteration=lgb_es_model.best_iteration)\n",
    "    sub_preds += lgb_es_model.predict(X_pred, num_iteration=lgb_es_model.best_iteration) / folds.n_splits\n",
    "    print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(val_y, oof_preds[val_idx])))\n",
    "    valid_score += roc_auc_score(val_y, oof_preds[val_idx])\n",
    "\n",
    "print('valid score:', str(round(valid_score/folds.n_splits,4)))\n",
    "\n",
    "end = time.time()\n",
    "print('training time:', str(round((end - start)/60)), 'mins')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "dfd82bb1859ca400fc95cfeec0d2c4864602f173"
   },
   "source": [
    "### <a id='4-4'>4.4 Feature importance</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 210,
   "metadata": {
    "_uuid": "0743d574351dde56f2c4076b187199c25b76708e",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "lgb.plot_importance(lgb_es_model, height=0.5, max_num_features=20, ignore_zero = False, figsize = (12,6), importance_type ='gain')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "_uuid": "53662147dc3ad3467847966f44c1a6cd4ddc1bd2"
   },
   "source": [
    "### <a id='4-5'>4.5 Prediction</a>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 213,
   "metadata": {
    "_uuid": "b615313333d57f89e0fa319810fc8282b085aa69",
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "application_test= pd.read_csv('../input/application_test.csv')\n",
    "output = pd.DataFrame({'SK_ID_CURR':application_test.SK_ID_CURR, 'TARGET': sub_preds})\n",
    "output.to_csv('only_application_pred.csv', index=False)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python [default]",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
